Before using R to illustrate basic programming concepts and data analysis tools, we will get familiar with the RStudio layout.
Rstudio contains 4 panels
RStudio has four primary panels that will help you interact with your data. We will use the default layout of these panels.
Source panel: Top left
Edit files to create ‘scripts’ of code
Console panel: Bottom left
Accepts code as input
Displays output when we run code
Environment panel: Top right
Everything that R is holding in memory
Objects that you create in the console or source panels will appear here
You can clear the environment with the broom icon
Viewer panel: Bottom-right
View graphics that you generate
Navigate files
Illustration
Let’s use these panels to create and interact with data.
Console:
Perform a calculation: type 2 + 2 into the console panel and hit ENTER
Create and store an object: type sum = 2 + 2 into the console panel and hit ENTER
Source:
Start an R script: Open new .R file (button in top-left below “File”)
Create and store an object: type sum = 2 + 3 into the source panel and hit cntrl+ENTER
Environment:
Confirm that the object sum is stored in our environment
Use rm(sum) to clear the object from the environment
Clear the environment with the broom icon
Viewer:
Navigate through your computer’s files
Create a plot in the source panel
Review of Basic Programming Concepts
Now that we understand the layout, we are ready to review the concepts covered in Module 2 Week 2.2. These concepts will help us understand what is happening when we create and manipulate data.
Objects: where values are saved in R
“Object” is a generic term for anything that R stores in the environment. This can include anything from an individual number or word, to lists of values, to entire datasets.
Importantly, objects belong to different “classes” depending on the type of values that they store.
Numerics are numbers
Characters are text or strings like "hello world" and "welcome to R".
Factors are a group of characters/strings with a fixed number of unique values
Logicals are either TRUE or FALSE
# Create a numeric objectmy_number =5.6# Check the classclass(my_number)
[1] "numeric"
# Create a character objectmy_character ="welcome to R"# Check the classclass(my_character)
[1] "character"
# Create a logical objectmy_logical =TRUE# Check the classclass(my_logical)
[1] "logical"
R can perform operations on objects.
# Create a numeric objectmy_number =5.6# Check the classclass(my_number)
[1] "numeric"
# Perform a calculationmy_number = my_number +5
The class of an object determines the type of operations you can perform on it. Some operations can only be run on numeric objects (numbers).
# Create a character objectmy_number ="5.6"# Check the classclass(my_number)# Perform a calculationmy_number +5round(my_number)
R contains functions that can convert some objects to different factors.
# Convert character to numericmy_number =as.numeric("5")class(my_number)
[1] "numeric"
my_number <-5# But R is only so smartmy_number =as.numeric("five")print(my_number)
[1] NA
Data Structures
The most simple objects are single values, but most data analysis involves more complicated data structures.
Lists
Lists are a type of data structure that store multiple values together. Lists are created using c() and allow you to perform operations on a series of values.
# Create a numeric list (also called a "vector")numeric_vector =c(6, 11, 13, 31)# Print the vectorprint(numeric_vector)
[1] 6 11 13 31
# Check the classclass(numeric_vector)
[1] "numeric"
# Calculate the meanmean(numeric_vector)
[1] 15.25
An important part of working with more complex data structures is called “indexing.” Indexing allows you to extract specific values from a data structure.
# Extract the 2nd element from the listnumeric_vector[2]
[1] 11
# Extract elements 2-4numeric_vector[2:4]
[1] 11 13 31
# Extract elements 1-2numeric_vector[c(TRUE, TRUE, FALSE, FALSE)]
[1] 6 11
Dataframes
Data frames are the most common type of data structure used in research. Data frames combine multiple lists of values into a single object.
# Create a dataframemy_data =data.frame(x1 =rnorm(100, mean =1, sd =1),x2 =rnorm(100, mean =1, sd =1))class(my_data)
[1] "data.frame"
Anything that comes in a spreadsheet (for example, an excel file) can be loaded into an R environment as a dataframe. R works most easily when spreadsheets are saved as a .csv file.
# Use `read.csv()` to load data from a websitedat =read.csv("https://raw.githubusercontent.com/jrspringman/psci3200-globaldev/main/workshops/aau_survey/clean_endline_did.csv") # Use `read.csv()` to load data from your computer's Downloads folder# dat = read.csv("/home/jeremy/Downloads/clean_endline_did.csv")
In most data frames, rows correspond to observations and the columns correspond to variables that describe the observations. Here, we are looking at survey data from an RCT involving university students in Addis Ababa. Each row correspondents to a different survey respondent, and each column represents their answers to a different question from the survey.
Loading Packages
Packages are an extremely important part of data analysis with R.
R gives you access to thousands of “packages” that are created by users
Packages contain bundles of code called “functions” that can execute specific tasks
Use install.packages() to install a package and library() to load a package
In the next section, we’ll use the package dplyr to perform some data cleaning. dplyr is part of a universe of packages called tidyverse. Since this is one of the most important packages in the R ecosystem, let’s install and load it.
# install.packages("tidyverse")library(tidyverse)
When you are searching online or asking ChatGPT how to perform a specific task in R, it often helps to specify that you are looking for a solution in dplyr.
Cleaning Data
In the real-world, data never comes ready to be analyzed. Data cleaning is the process of manipulating data so that it can be analyzed. This is usually the most difficult and time-consuming part of any data analysis project. Let’s walk through some examples.
Question: In your work, what data cleaning have you had to do? How have you cleaned your data?
Creating Variables
Imagine we want to analyze the relationship between whether a respondent moved to to Addis Ababa to attend university and their level of political participation. However, there are two problems:
We don’t have a specific variable that measures whether or not respondents moved
We have many measures of participation
How can we create a variable measuring whether the respondent moved to Addis Ababa? We have a multiple-choice question asking students about what region they come from.
Let’s start by investigating this variable.
# The name of the variable in our dataframe is `q8_baseline`table(dat$q8_baseline)
Addis Ababa
283
Afar Region
1
Amhara Region
174
Benishangul-Gumuz Region
5
Dire Dawa (city)
5
Gambela Region
1
Harari Region
4
Oromia Region
205
Other
12
Prefer not to say
19
Sidama Region
35
Somali Region
5
South West Ethiopia Peoples Region
6
Southern Nations, Nationalities, and Peoples Region
65
Tigray Region
5
dat = dat %>%# this is called a "pipe"# give our variable a better namerename(home_region = q8_baseline) dat = dat %>%# drop respondents who report "Prefer not to say"filter(!home_region =="Prefer not to say") dat = dat %>%# clean home region variable using `mutate()`mutate(# Shorten a long name to an abbreviationhome_region =ifelse(home_region =="Southern Nations, Nationalities, and Peoples Region", "SNNPR", home_region),# remove the word "Region" from every observation of this columnhome_region =str_remove(home_region, " Region"),home_region =str_remove(home_region, " \\(city\\)") )# Check if it workedtable(dat$home_region)
Addis Ababa Afar
283 1
Amhara Benishangul-Gumuz
174 5
Dire Dawa Gambela
5 1
Harari Oromia
4 205
Other Sidama
12 35
SNNPR Somali
65 5
South West Ethiopia Peoples Tigray
6 5
# Chain these all together for more concise codedat =read_csv("https://raw.githubusercontent.com/jrspringman/psci3200-globaldev/main/workshops/aau_survey/clean_endline_did.csv" ) %>%# give our variable a better namerename(home_region = q8_baseline) %>%# drop respondents who report "Prefer not to say"filter(!home_region =="Prefer not to say") %>%# clean home region variablemutate(# Shorten a long name to an abbreviationhome_region =case_when( home_region =="Southern Nations, Nationalities, and Peoples Region"~"SNNPR", home_region =="South West Ethiopia Peoples Region"~"SWEPR",TRUE~ home_region ),# remove the word "Region" from every observation of this columnhome_region =str_remove(home_region, " Region| \\(city\\)") )# Check if it workedtable(dat$home_region)
Addis Ababa Afar Amhara Benishangul-Gumuz
283 1 174 5
Dire Dawa Gambela Harari Oromia
5 1 4 205
Other Sidama SNNPR Somali
12 35 65 5
SWEPR Tigray
6 5
Now that we’ve cleaned-up the names, we want to create a variable that tells us whether or not each respondent is originally from Addis Ababa. This will let us measure whether or not they moved to Addis in order to attend college.
# Creating a measure of whether a respondent moved to Addis Ababa in order to at independent variabledat = dat %>%mutate(moved =case_when( home_region =="Addis Ababa"~0,TRUE~1 ) )table(dat$moved, dat$home_region)
Now, we need to create our second variable measuring levels of political participation. Remember, the challenge is that we have multiple measures of participation. Let’s start with two measures:
Number of times you’ve contacted gov’t official q13_4_1
Number of times you’ve signed a petition q13_5_1
# Check out the distribution of our variablesdat %>%select(q13_4_1, q13_5_1) %>%head(5)
# Create a single measureedat = dat %>%mutate(add_participation = q13_4_1 + q13_5_1 )hist(dat$add_participation)
# how do investigate NA values?
One thing we need to be careful with are NA values. We need to think carefully about why NA values are in our data and how to handle them appropriately.
Question: Thinking about the data that you have worked with, what are the most common sources of NA values?
## Find our two participation measuresadd_ecols =grep("q13_4_1$|q13_5_1$", names(dat), value = T)dat = dat %>%mutate(add_participation =rowSums(across(add_ecols) ) )#mutate(add_participation_end = rowSums(across(add_ecols), na.rm = T) )# dat = dat %>%# mutate(home_region = na_if(home_region, "Prefer not to say")) %>%
Merging Datasets
Often, data analysis projects will require you to use variables from more than one dataset. This will require you to combine separate datasets into a single dataset in R, which is called merging. A merge uses an identifier (administrative unit names, respondent IDs, etc.) that is present in both datasets to combine them into one.
Question: Thinking about the data that you have worked with, have you ever had to merge multiple datasets? What was the data? How did you do it?
## Here we are selecting the unit-identifier (response_id) and the two variables we have created and saving them to a separate dataframenew_vars = dat %>%select(response_id, moved, add_participation)## After you do some cleaning or create new variables, you may want to save that dataframe for future use# write.csv(new_dat, here::here("workshops/aau_survey/new_vars.csv"))## Drop the new variables from our dataframe so that we can merge them back indat = dat %>%select( -moved, -add_participation)## Check to make sure the merge will work as expected# nrow(dat)# nrow(new_vars)# dat$response_id[! dat$response_id %in% new_vars$response_id]## Use dplyr's `full_join` function to merge datasetsdat = dat %>%full_join(., new_vars)
Data Visualization
One of the major benefits of using R is the ability to create beautiful tables and figures that can be incorporated into deliverables and other professional documents. Importantly, you can create a script that allows you to automatically recreate the same visualizations when you want to make slight changes or incorporate new data.
Question: What methods do you currently use to create data visualizations for deliverables and other research products?
Tables
#install.packages("gt")library(gt)desciptives = dat %>%group_by(moved) %>%summarise(observations =n(), share = observations /nrow(dat),mean =mean(add_participation, na.rm = T)) %>%mutate(moved =case_when(moved ==0~"No", moved ==1~"Yes"),share = share *100)gt(desciptives) %>%tab_header(title ="Moving to University and Political Participation", ) %>%fmt_number(columns =3:4,decimals =2,use_seps =FALSE ) %>%cols_label(moved =md("Moved to<br>Addis"),observations =md("Respondents<br>(count)"),share =md("Respondents<br>(%)"),mean =md("Participation<br>(mean score)") ) %>%gtsave(filename ="tab_1.png")
Figures
There are many ways to create figures in R, but ggplot is the most popular.
dat %>%filter(add_participation <15) %>%ggplot(aes(add_participation)) +geom_histogram(binwidth =1)
dat %>%filter(add_participation <15) %>%ggplot(aes(x = add_participation, fill = q3_baseline)) +geom_histogram(binwidth =1, alpha =0.5, position ="identity")
You can do tremendous amounts of customization with ggplot to create extremely informative and professional plots.
## demographicsdemos = dat %>%drop_na(q3_baseline) %>%select(`Respondent Gender`= q3_baseline, `Work as a student?`= q4_baseline, `Rural or urban?`= q5_baseline, `Financial support from family?`= q6_baseline, `Home region`= home_region,`Student Year`= class_year) %>%pivot_longer(everything()) %>%group_by(name, value) %>%tally() %>%mutate(pct = n/sum(n)) %>%top_n(n =5, wt = pct)demos$name =factor(demos$name, levels =c('Home region', 'Rural or urban?', 'Student Year', 'Respondent Gender', 'Work as a student?', 'Financial support from family?'))demos <- demos %>%group_by(name) %>%mutate(total_n =sum(n)) %>%ungroup()ggplot(demos , aes(y = value, x = pct)) +geom_col(fill ="grey") +facet_wrap( ~name, scales ="free") +scale_y_discrete(labels = scales::label_wrap(30)) + hrbrthemes::scale_x_percent(limits =c(0, 1)) +labs(x ="Percent of respondents to the survey", y =NULL, title ="Demographic characteristics of baseline respondents", subtitle = glue::glue("Number of respondents = {scales::comma(nrow(df))}"), caption ="Note: home region only displays top five categories by size.") +geom_text(aes(label = glue::glue("n = {total_n}") ), x =0.92, y =-Inf, vjust =-1, hjust =1, inherit.aes =FALSE )
Now let’s use simple linear regression to estimate the relationship between these two variables that we created.
Question: Have you used regression in your research? If so, what software did you use to
est =lm(add_participation ~ moved, dat)summary(est)
Call:
lm(formula = add_participation ~ moved, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.788 -1.788 -0.788 0.609 33.609
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3911 0.1624 8.567 <2e-16 ***
moved 0.3967 0.2010 1.973 0.0488 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.673 on 778 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.00498, Adjusted R-squared: 0.003701
F-statistic: 3.894 on 1 and 778 DF, p-value: 0.04881
What if we want to control for the influence of the number of years someone has been at university? Here, we can create a new measure of year and use multiple regression.